For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
This dataset provides insights into energy efficiency meter evaluated projects for residential existing homes in New York state from 2007-2012. We will examine the variables in the dataset to determine what helps to predict the changes in energy efficiency in these homes.
This dataset has 2696 rows and 20 variables.
The Devastator. (n.d.). Residential Home Energy Efficiency [Dataset].
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
Organizing data can also include summarizing data values in simple one-way and two-way tables.
cust_type home_size home_vol num_units
Min. :0.0000 Min. : 520 Min. : 4160 Min. :1.000
1st Qu.:0.0000 1st Qu.: 1352 1st Qu.:10752 1st Qu.:1.000
Median :1.0000 Median : 1788 Median :14208 Median :1.000
Mean :0.7352 Mean : 1946 Mean :15569 Mean :1.034
3rd Qu.:1.0000 3rd Qu.: 2308 3rd Qu.:18432 3rd Qu.:1.000
Max. :1.0000 Max. :10260 Max. :82080 Max. :4.000
NA's :131
home_year total_cost contractor_incentive total_incentive
Min. :1756 Min. : 500 Min. : 0.0 Min. : 0.0
1st Qu.:1940 1st Qu.: 4664 1st Qu.: 224.5 1st Qu.: 422.5
Median :1955 Median : 7200 Median : 362.2 Median : 800.0
Mean :1951 Mean : 8223 Mean : 360.9 Mean : 1565.5
3rd Qu.:1970 3rd Qu.:10471 3rd Qu.: 500.0 3rd Qu.: 2010.8
Max. :2010 Max. :44245 Max. :1050.0 Max. :10000.0
NA's :33
est_y1_save elec_base_kwh gas_base_mmbtu elec_report
Min. : 1.19 Min. : 0 Min. : 0.00 Min. : 0
1st Qu.: 223.75 1st Qu.: 2393 1st Qu.: 0.00 1st Qu.: 2490
Median : 501.18 Median : 6280 Median : 58.72 Median : 6082
Mean : 653.42 Mean : 6597 Mean : 58.19 Mean : 6387
3rd Qu.: 906.30 3rd Qu.: 9632 3rd Qu.:104.48 3rd Qu.: 9315
Max. :5919.87 Max. :45372 Max. :398.63 Max. :42521
gas_report elec_eval_save_kwh gas_eval_save_mmbtu National_Grid
Min. : 0.00 Min. :-11584.5 Min. :-180.170 Min. :0.0000
1st Qu.: 0.00 1st Qu.: -294.8 1st Qu.: 0.000 1st Qu.:1.0000
Median : 50.73 Median : 0.0 Median : 0.000 Median :1.0000
Mean : 50.12 Mean : 210.1 Mean : 8.065 Mean :0.7622
3rd Qu.: 89.08 3rd Qu.: 689.3 3rd Qu.: 14.627 3rd Qu.:1.0000
Max. :362.63 Max. : 16589.6 Max. : 184.510 Max. :1.0000
National_Fuel_Gas NYSEG Orange_Rockland Rochester_Gas_Electric
Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.00000
Median :0.0000 Median :0.00000 Median :0.000 Median :0.00000
Mean :0.1725 Mean :0.03264 Mean :0.023 Mean :0.01595
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.00000
All variables that are not pertinent to the dependent variables, such as zip code, locations, and indexes, have been removed. We can see that some of the variables have many null values.
cust_type is a 1 if they are market, and zero if assisted. This will be transformed into a categorical variable. Each of the different providers will be converted into true = 1, false = 0.
# A tibble: 2 × 2
cust_type n
<chr> <int>
1 0 714
2 1 1982
We can see we have about 73% of the variable as market customers.
Looking at the potential predictors of cust_type, we predict the
strongest relationships between total_incentive and home_size.
We see the largest concentration of values around $0 - $1000. We see the
two strongest relationships between home_year and num_units. The data is
skewed right.
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9401.866 | 586.106 | 16.041 | 0.000 |
| home_year | -4.677 | 0.297 | -15.731 | 0.000 |
| total_incentive | 0.083 | 0.006 | 13.826 | 0.000 |
| total_cost | 0.024 | 0.003 | 8.649 | 0.000 |
| Orange_Rockland1 | -520.054 | 79.850 | -6.513 | 0.000 |
| National_Fuel_Gas1 | -336.519 | 58.227 | -5.779 | 0.000 |
| Rochester_Gas_Electric1 | -494.876 | 85.701 | -5.774 | 0.000 |
| National_Grid1 | -293.880 | 54.024 | -5.440 | 0.000 |
| num_units | 166.685 | 43.937 | 3.794 | 0.000 |
| contractor_incentive | 0.214 | 0.072 | 2.969 | 0.003 |
| elec_eval_save_kwh | 3389.323 | 1930.772 | 1.755 | 0.079 |
| elec_base_kwh | -3389.288 | 1930.773 | -1.755 | 0.079 |
| elec_report | 3389.286 | 1930.773 | 1.755 | 0.079 |
| home_size | 0.036 | 0.050 | 0.718 | 0.473 |
| gas_report | -1464.777 | 2418.132 | -0.606 | 0.545 |
| gas_base_mmbtu | 1463.823 | 2418.131 | 0.605 | 0.545 |
| gas_eval_save_mmbtu | -1462.800 | 2418.127 | -0.605 | 0.545 |
| home_vol | 0.003 | 0.006 | 0.551 | 0.582 |
| NYSEG1 | -18.943 | 52.700 | -0.359 | 0.719 |
After examining this model, we determine that there are some predictors that are not important in prediciting the estimated savings, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Linear Regression Model. We removed elec_eval_save_kwh, elec_base_kwh, elec_report, home_size, gas_report, gas_base_mmbtu,gas_eval_save_mmbtu, home_vol, and NYSEG.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9500.708 | 587.735 | 16.165 | 0.000 |
| home_year | -4.597 | 0.299 | -15.388 | 0.000 |
| total_cost | 0.031 | 0.003 | 10.308 | 0.000 |
| National_Fuel_Gas1 | -399.360 | 56.437 | -7.076 | 0.000 |
| Orange_Rockland1 | -548.489 | 78.624 | -6.976 | 0.000 |
| Rochester_Gas_Electric1 | -493.196 | 86.173 | -5.723 | 0.000 |
| National_Grid1 | -301.768 | 53.853 | -5.604 | 0.000 |
| cust_type1 | -237.908 | 47.651 | -4.993 | 0.000 |
| num_units | 201.917 | 44.193 | 4.569 | 0.000 |
| contractor_incentive | 0.260 | 0.073 | 3.591 | 0.000 |
| total_incentive | 0.021 | 0.013 | 1.577 | 0.115 |
After examining this model, looking at the residual plots we can see that there are some issues with our data. The high values at the right of the Q-Q plot could demonstrate some non-linearity or outliers in the data. We can see the same three points in our residuals vs fitted chart. We could transform the data or try some additional models to try to improve this fit.
Reducing the predictors that did not help with prediction of the estimated savings lowered our r^2 and increased our RMSE.
From the following table, we can see the effect on Median Value by the predictor variables.| Variable | Direction |
|---|---|
| home_year | Decrease |
| total_cost | Increase |
| National_Fuel_Gas1 | Decrease |
| Orange_Rockland1 | Decrease |
| Rochester_Gas_Electric1 | Decrease |
| National_Grid1 | Decrease |
| cust_type1 | Decrease |
| num_units | Increase |
| contractor_incentive | Increase |
| total_incentive | Increase |
Row ———————————————————————–
Examining these models, the Neural Network model performed the most accurately to predict the validation data. The decision tree also performed well in its prediction of the validation data, and we can easily tell how it was able to come to these predictions. It seems that total incentive and total cost were the two largest determining factors of whether the customer was a market customer or an assisted customer.
In conclusion, we see that our predictors can accurately predict what type of customer a person is, but is less successful at accurately predicting what the estimated cost for renovations would be.
---
title: "Heating System Renovation"
author: "Aiden Belt"
date: "2023-06-04"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
library(readxl)
```
```{r load_data}
df <- read_excel("ProjectData_belt.xlsx", sheet = "Clean Data")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=250}
-----------------------------------------------------------------------
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
This dataset provides insights into energy efficiency meter evaluated projects for residential existing homes in New York state from 2007-2012. We will examine the variables in the dataset to determine what helps to predict the changes in energy efficiency in these homes.
#### The Data
This dataset has 2696 rows and 20 variables.
#### Data Sources
The Devastator. (n.d.). Residential Home Energy Efficiency [Dataset].
### The Data
VARIABLES TO PREDICT WITH
* *home_size*: The size of the home in which the project was completed.
* *home_vol*: The volume of the home in which the project was completed.
* *num_units*: The number of units in the home in which the project was completed.
* *home_year*: The year the home in which the project was completed was built.
* *total_cost*: The total cost of the project.
* *contractor_incentive*: The incentive given to the contractor for completing the project.
* *total_incentive*: The total incentives given for the project.
* *elec_base_kwh*: The baseline electric usage in kWh.
* *gas_base_mmbtu*: The baseline gas usage in MMBtu.
* *elec_report*: The reported electric usage in kWh.
* *gas_report*: The reported gas usage in MMBtu.
* *elec_eval_save_kwh*: The evaluated electric usage in kWh.
* *gas_eval_save_mmbtu*: The evaluated gas usage in MMBtu.
* *National_Grid*: Whether the customer subscribes to National Grid.
* *National_Fuel_Gas*: Whether the customer subscribes to National Fuel and Gas.
* *NYSEG*: Whether the customer subscribes to NYSEG.
* *Orange_Rockland*: Whether the customer subscribes to Orange & Rockland.
* *Rochester_Gas_Electric*: Whether the customer subscribes to Rochester Gas & Electric.
VARIABLES WE WANT TO PREDICT
* *est_y1_save*: The estimated first year energy bill savings in dollars.
* *cust_type*: The type of customer the project was completed for, market(1) or assisted(0).
Data {data-orientation=rows}
=======================================================================
Column
-----------------------------------------------------------------------
### Organize the Data
Organizing data can also include summarizing data values in simple one-way and two-way tables.
```{r, cache=TRUE}
summary(df)
```
All variables that are not pertinent to the dependent variables, such as zip code, locations, and indexes, have been removed. We can see that some of the variables have many null values.
Column
-----------------------------------------------------------------------
### Transform Variables
cust_type is a 1 if they are market, and zero if assisted. This will be transformed into a categorical variable. Each of the different providers will be converted into true = 1, false = 0.
```{r, cache=TRUE}
df <- mutate(df,cust_type=as.factor(cust_type),
National_Grid=as.factor(National_Grid),
National_Fuel_Gas=as.factor(National_Fuel_Gas),
NYSEG=as.factor(NYSEG),
Orange_Rockland=as.factor(Orange_Rockland),
Rochester_Gas_Electric=as.factor(Rochester_Gas_Electric)
)
```
#### cust_type (Market or Assited?)
```{r, cache=TRUE}
as_tibble (select(df,cust_type) %>%
table())
```
#### cust_type (Market or Assisted)

Data Vizualization #1
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### cust_type Market(1)/Assisted(0)
```{r, cache=TRUE}
as_tibble(select(df,cust_type) %>%
table()) %>% #rename(CAT.MEDV=".") %>%
ggplot(aes(y=n,x=cust_type)) + geom_bar(stat="identity")
```
We can see we have about 73% of the variable as market customers. Looking at the potential predictors of cust_type, we predict the strongest relationships between total_incentive and home_size.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df, cust_type, home_year, contractor_incentive, home_size, home_vol, num_units, total_incentive))
```
Data Vizualization #2
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### est_y1_save
```{r, cache=TRUE}
ggplot(df, aes(est_y1_save)) + geom_histogram(bins=20)
```
We see the largest concentration of values around $0 - $1000. We see the two strongest relationships between home_year and num_units. The data is skewed right.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df, est_y1_save, home_size, home_vol, num_units, home_year, elec_base_kwh, gas_base_mmbtu))
```
est_y1_save Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict Median Value
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
est_y1_save_lm <- lm(est_y1_save ~ . -cust_type,data = df)
summary(est_y1_save_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(est_y1_save_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(est_y1_save_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-down")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(est_y1_save_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-down")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
knitr::kable(summary(est_y1_save_lm)$coef, digits = 3) #pretty table output
summary(est_y1_save_lm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(est_y1_save_lm))[,4])
out <- coef(summary(est_y1_save_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(est_y1_save_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in prediciting the estimated savings, so a pruned version of the model is created by removing predictors that are not significant.
Row
-----------------------------------------------------------------------
### Predict Median Value Final Version
For this analysis we will use a pruned Linear Regression Model. We removed elec_eval_save_kwh, elec_base_kwh, elec_report, home_size, gas_report, gas_base_mmbtu,gas_eval_save_mmbtu, home_vol, and NYSEG.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
est_y1_save_lm <- lm(est_y1_save ~ . -elec_eval_save_kwh -elec_base_kwh -elec_report -home_size -gas_report -gas_base_mmbtu -gas_eval_save_mmbtu -home_vol -NYSEG ,data = df)
summary(est_y1_save_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(est_y1_save_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(est_y1_save_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-down")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(est_y1_save_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-down")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(est_y1_save_lm)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(est_y1_save_lm))[,4])
out <- coef(summary(est_y1_save_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(est_y1_save_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, looking at the residual plots we can see that there are some issues with our data. The high values at the right of the Q-Q plot could demonstrate some non-linearity or outliers in the data. We can see the same three points in our residuals vs fitted chart. We could transform the data or try some additional models to try to improve this fit.
Reducing the predictors that did not help with prediction of the estimated savings lowered our r^2 and increased our RMSE.
From the following table, we can see the effect on Median Value by the predictor variables.
```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data_frame(
Variable = c('home_year', 'total_cost','National_Fuel_Gas1','Orange_Rockland1','Rochester_Gas_Electric1','National_Grid1','cust_type1', 'num_units', 'contractor_incentive', 'total_incentive'),
Direction = c('Decrease','Increase','Decrease', 'Decrease','Decrease','Decrease', 'Decrease', 'Increase', 'Increase', 'Increase')
)
knitr::kable(predchang) #pretty table output
```
cust_type Analysis {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict Customer Type

Row
-----------------------------------------------------------------------
### Analysis Summary
Examining these models, the Neural Network model performed the most accurately to predict the validation data. The decision tree also performed well in its prediction of the validation data, and we can easily tell how it was able to come to these predictions. It seems that total incentive and total cost were the two largest determining factors of whether the customer was a market customer or an assisted customer.
Conclusion
=======================================================================
### Summary
In conclusion, we see that our predictors can accurately predict what type of customer a person is, but is less successful at accurately predicting what the estimated cost for renovations would be.